Citation: This dataset is public available for research. The details are described in [Cortez et al., 2009]. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Question: With different variables in the white wine dataset, the analysis will identify chemical properties that influence the quality of white wines.
General exploration
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
The dataset has 4,898 rows and 12 variables.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
Some variables (residual sugar, free sulfur dioxide, total sulfur dioxide) have max values that are far from median, which might be outliers.
Quality exploration
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
Most white wine are with quality of 6. There are few very good quality or bad quality white wine in the market.
The quality distribution looks very close to normal distribution.
Fixed acidity exploration
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
Most of the wine have fixed acidity between 6.3 - 7.3g/ dm^3: median 6.8g/ dm^3 and mean 6.855 g/ dm^3. The fixed acidity looks follow the normal distribution.
Volatile acidity exploration
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
Most of the wine have volatile acidity between 0.21 - 0.32g/ dm^3: median 0.26g/ dm^3 and mean 0.2782 g/ dm^3. The volatile acidity looks follow the normal distribution.
Citric acid exploration
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
Most of the wine have citric acid between 0.27 - 0.39g/ dm^3: median 0.32g/ dm^3 and mean 0.3342 g/ dm^3. The citric acid looks follow the normal distribution but with an unsual spike at 0.49g/ dm^3 and 0.74g/ dm^3.
Residual sugar exploration
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Most of the wine have residual sugar between 1.7 - 9.9g/ dm^3: median 5.2g/ dm^3 and mean 6.39 g/ dm^3. The residual sugar has a right tailed chart and I’d like to see the distribution of transformed variable.
The transformed residual sugar distribution appears bimodal. There might be two group of wine, one group are more sweet than the other one.
Chlorides exploration
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Most of the wine have chlorides between 0.036 - 0.05g/ dm^3: median 0.043g/ dm^3 and mean 0.057 g/ dm^3. The chlorides looks follow the normal distribution but with outliers above 0.1g/ dm^3.
Free sulfur dioxide exploration
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
Most of the wine have free.sulfur.dioxide between 23 - 46g/ dm^3: median 34g/ dm^3 and mean 35.31 g/ dm^3. The free.sulfur.dioxide looks follow the normal distribution.
Total sufur dioxide exploration
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
Most of the wine have total sulfur dioxide between 108 - 167g/ dm^3: median 134g/ dm^3 and mean 138.4 g/ dm^3. The total sulfur dioxide looks follow the normal distribution.
Density exploration
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
Almost all the density of the wine are close to 1.
PH exploration
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
Most of the wine have pH value between 3 - 3.3: median 3.18 and mean 3.188. The pH looks follow the normal distribution.
Sulphates exploration
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
Most of the wine have sulphates value between 0.41 - 0.55g/ dm^3: median 0.47g/ dm^3 and mean 0.489g/ dm^3 The sulphates looks follow the normal distribution.
Alcohol exploration
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Most of the wine have alcohol value between 9.5 - 11.4g/ dm^3: median 10.4g/ dm^3 and mean 10.51g/ dm^3 The alcohol looks follow the normal distribution.
There are 4,898 white wines with 11 variables(fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol).
Other observations:
Quality is the dependent variable and will be the main feature. I’d like to find the features that best predict the quality of the wine. Currently, I guess the
may be the key factors to determine wine quality.
No.
Residual sugar is right tailed and I made log transformation on it to find a normal distribution.
Correlation
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.02269729 0.289180698
## volatile.acidity -0.02269729 1.00000000 -0.149471811
## citric.acid 0.28918070 -0.14947181 1.000000000
## residual.sugar 0.08902070 0.06428606 0.094211624
## chlorides 0.02308564 0.07051157 0.114364448
## free.sulfur.dioxide -0.04939586 -0.09701194 0.094077221
## total.sulfur.dioxide 0.09106976 0.08926050 0.121130798
## density 0.26533101 0.02711385 0.149502571
## pH -0.42585829 -0.03191537 -0.163748211
## sulphates -0.01714299 -0.03572815 0.062330940
## alcohol -0.12088112 0.06771794 -0.075728730
## quality -0.11366283 -0.19472297 -0.009209091
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.08902070 0.02308564 -0.0493958591
## volatile.acidity 0.06428606 0.07051157 -0.0970119393
## citric.acid 0.09421162 0.11436445 0.0940772210
## residual.sugar 1.00000000 0.08868454 0.2990983537
## chlorides 0.08868454 1.00000000 0.1013923521
## free.sulfur.dioxide 0.29909835 0.10139235 1.0000000000
## total.sulfur.dioxide 0.40143931 0.19891030 0.6155009650
## density 0.83896645 0.25721132 0.2942104109
## pH -0.19413345 -0.09043946 -0.0006177961
## sulphates -0.02666437 0.01676288 0.0592172458
## alcohol -0.45063122 -0.36018871 -0.2501039415
## quality -0.09757683 -0.20993441 0.0081580671
## total.sulfur.dioxide density pH
## fixed.acidity 0.091069756 0.26533101 -0.4258582910
## volatile.acidity 0.089260504 0.02711385 -0.0319153683
## citric.acid 0.121130798 0.14950257 -0.1637482114
## residual.sugar 0.401439311 0.83896645 -0.1941334540
## chlorides 0.198910300 0.25721132 -0.0904394560
## free.sulfur.dioxide 0.615500965 0.29421041 -0.0006177961
## total.sulfur.dioxide 1.000000000 0.52988132 0.0023209718
## density 0.529881324 1.00000000 -0.0935914935
## pH 0.002320972 -0.09359149 1.0000000000
## sulphates 0.134562367 0.07449315 0.1559514973
## alcohol -0.448892102 -0.78013762 0.1214320987
## quality -0.174737218 -0.30712331 0.0994272457
## sulphates alcohol quality
## fixed.acidity -0.01714299 -0.12088112 -0.113662831
## volatile.acidity -0.03572815 0.06771794 -0.194722969
## citric.acid 0.06233094 -0.07572873 -0.009209091
## residual.sugar -0.02666437 -0.45063122 -0.097576829
## chlorides 0.01676288 -0.36018871 -0.209934411
## free.sulfur.dioxide 0.05921725 -0.25010394 0.008158067
## total.sulfur.dioxide 0.13456237 -0.44889210 -0.174737218
## density 0.07449315 -0.78013762 -0.307123313
## pH 0.15595150 0.12143210 0.099427246
## sulphates 1.00000000 -0.01743277 0.053677877
## alcohol -0.01743277 1.00000000 0.435574715
## quality 0.05367788 0.43557472 1.000000000
There are some relationships between variables from the correlation chart.
As I mentioned earlier, it looks that there are some outliers in the dataset and I’ll remove top and bottom 1% of the data for the following analysis.
Since quality is the target variable and moderately correlated to alcohol and density, I want to explore on these two variables first.
Alcohol
The scatter plot has many overlapped points. I will adjust the alpha to have a clear chart.
From both scatter plot with alpha adjustment and boxplot, it looks that higher quality wine usually have slightly higher alcohol. The alcohol might be one of the key factors for the Quality of wine.
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.55 10.45 10.35 11.00 12.60
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.40 10.10 10.15 10.75 13.50
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.000 9.200 9.500 9.809 10.300 13.600
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 9.60 10.50 10.58 11.40 14.00
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.60 10.60 11.40 11.37 12.30 14.20
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 11.00 12.00 11.64 12.60 14.00
## --------------------------------------------------------
## wine$quality: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 12.40 12.50 12.18 12.70 12.90
Another takeout is that for quality lower than 6, most alcohol of wines are less than 10.5%, while wines with quality above 6 typically have alcohol more than 11.4%.
From the frequency plot, high quality wine skew to high alcohol value while low quality wine skew to low alcohol value.
Density
From both scatter plot with alpha adjustment and boxplot, it looks that higher quality wine usually have slightly lower density.
Besides alcohol and dnesity, I’d like to take a look at other interest variables (volatile acidity, free sulfur dioxide, citric acid, and residual sugar).
Volatile acidity
It seems that higher quality wines usually have lower volatile.acidity.
Citric acid
Citric.acid don’t have much impact on quality.
Residual.sugar
Residual.sugar don’t have much impact on quality.
Free.sulfur.dioxide
Free.sulfur.dioxide don’t have much impact on quality.
Density and residual.sugar have the strongest correlation of 0.839.
Alcohol
Since alcohol along can not predict the quality very well, I’d like to add other variables to interact with alcohol to find out more insights. Before any further action, I created two buckets for different quality groups, one group with lower quality (2 - 6) and another one with higher quality (7 - 10).
First, I’ll include density to see how the plot changes.
From the scatter plot, alcohol and density exist an negaitve relationship, which leads to high quality wine tends to have higher alcohol and lower density value.
Then, I tried to put the volatile.acidity, which will cause unpleasant scent, in the chart to find out pattern if possible. From the scatter plot, it shows that higher quality wines usually exist in higher alcohol group. However, there is no specific trend for volatile.acidity in different quality groups.
It didn’t show much evidence that higher quality wines will have lower volatile.acidity as I expected.
It seems that sugar variance is smaller among high quality wine, sugar usually controlled under 10 g/dm^3.
It didn’t show much evidence that Free sulfur Dioxide has impact on wine quality.
It didn’t show much evidence that Citric has impact on wine quality.
I also want to see how density interact with other variables for the following analysis.
Density
First, I include the residual.sugar variable and it looks that when density is the same, high quality wine has more sugar compared to low quality wine.
It didn’t show much evidence that Total Sulfur Dioxide has impact on wine quality.
m1 <- lm(quality ~ alcohol, data = wine)
m2 <- update(m1, ~ . + density)
m3 <- update(m2, ~ . + residual.sugar)
m4 <- update(m3, ~ . + citric.acid)
m5 <- update(m4, ~ . + free.sulfur.dioxide)
m6 <- update(m5, ~ . + volatile.acidity)
mtable(m1, m2, m3, m4, m5, m6, sdigits = 3)
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wine)
## m2: lm(formula = quality ~ alcohol + density, data = wine)
## m3: lm(formula = quality ~ alcohol + density + residual.sugar, data = wine)
## m4: lm(formula = quality ~ alcohol + density + residual.sugar + citric.acid,
## data = wine)
## m5: lm(formula = quality ~ alcohol + density + residual.sugar + citric.acid +
## free.sulfur.dioxide, data = wine)
## m6: lm(formula = quality ~ alcohol + density + residual.sugar + citric.acid +
## free.sulfur.dioxide + volatile.acidity, data = wine)
##
## ===========================================================================================================
## m1 m2 m3 m4 m5 m6
## -----------------------------------------------------------------------------------------------------------
## (Intercept) 2.582*** -22.492*** 90.313*** 95.983*** 89.836*** 68.954***
## (0.098) (6.165) (12.374) (12.558) (12.519) (12.174)
## alcohol 0.313*** 0.360*** 0.246*** 0.240*** 0.257*** 0.300***
## (0.009) (0.015) (0.018) (0.018) (0.018) (0.018)
## density 24.728*** -87.886*** -93.619*** -87.744*** -66.481***
## (6.079) (12.317) (12.506) (12.465) (12.123)
## residual.sugar 0.053*** 0.055*** 0.049*** 0.047***
## (0.005) (0.005) (0.005) (0.005)
## citric.acid 0.246** 0.194* -0.080
## (0.095) (0.095) (0.093)
## free.sulfur.dioxide 0.005*** 0.004***
## (0.001) (0.001)
## volatile.acidity -2.008***
## (0.111)
## -----------------------------------------------------------------------------------------------------------
## R-squared 0.190 0.192 0.210 0.211 0.220 0.269
## adj. R-squared 0.190 0.192 0.210 0.211 0.219 0.268
## sigma 0.797 0.796 0.787 0.787 0.783 0.758
## F 1146.395 583.290 434.085 327.629 275.702 300.070
## p 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -5839.391 -5831.127 -5776.812 -5773.443 -5746.636 -5587.029
## Deviance 3112.257 3101.773 3033.737 3029.566 2996.585 2807.519
## AIC 11684.782 11670.255 11563.624 11558.886 11507.272 11190.058
## BIC 11704.272 11696.241 11596.107 11597.866 11552.748 11242.031
## N 4898 4898 4898 4898 4898 4898
## ===========================================================================================================
Although the R^2 increases as the variables increase in the model, all the variables together still can not explain very well about the quality.
The higher quality wines usually have higher alcohol but volatile.acidity and free.sulfur.dioxide don’t really impact on quality.
I thought that volatile.acidity would exist less in high quality of wines but it turns out that the volatile.acidity disperse similarly across differnt quality levels.
Yes, I created models with different variables. The dataset is alreday in a tidy format which is convenient to build regression models. However, it might be lack of information about the place of production or the year of production, which might affce the quality.
My interest variable wine Quality has a relationship with alcohol(correlation of 0.436) and density(correlation of 0.3). From the Alcohol and Quality scatter plot, it looks that there is an upward trend, which shows that higher quality wine usually have higher alcohol. To take a closer look in related boxplot, it really shows that the median alcohol value increases from 9.5% at quality 5 to 12.5% at quality 9. I also noticed that high quality(7, 8, 9) wine mostly exist with alcohol greater than 11.5%. For another variable density, there is a downward trend in Density and Quality scatter plot, which indicates that lower quality wine usually have higher density. The related boxplot also shows that the median density value decreases from around 0.995 g/dm^3 at quality 5 to 0.99 g/dm^3 at quality 9
Since Alcohol and Density are the two most related varialbels to quality, I put the variables together and find an interesting relationship. Alcohol and Density exist an negaitve relationship which leads to higher alcohol wine has lower density. The high quality wine usually exist with high alcohol and also tends to have lower density value.
We know from previous plot that high quality wine tends to have lower density. From the plot three, we also notice that low density wine usually has low residual sugar. The high quality wine mostly show up under density of 0.995 g/dm^3 and under sugar of 10 g/dm^3.
I feel hard to focus on specific variables only by univariate analysis, whcih might lead to wrong directions for the further analysis. The correlation plots help me to identify most relevant variables but meanwhile, I’m also not sure whether to ignore those irrelevant ones. As I tried different combinations of variables to generate bivariate or mulitivariate plots, I feel usefult to leverage the correlation information to identify trends for different quality groups, which I think I can apply to future works if I don’t know where to start.